Data Analysis Project on White Wines by Raul Armendariz

The data set that I’ll be working with comes from a study done by Cortez et. al, “Modeling wine preferences by data mining from physio-chemical properties”. The data that is available from the study was separated into two data sets, where one focuses on white wines and the other on red wines. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). This analysis will focus on the white wine data set, and we will see what findings we can gather from the data and see if any of the input variables are correlated with the quality that the experts gave the wines.

Now, lets load in the data set and see the details of the data. We’ll also check to see if there are any null values just in case. If we find any null values, we can work around that in our upcoming analysis.

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...

Some things to note:

1 - fixed acidity (tartaric acid - g / dm^3):most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid (g / dm^3): found in small quantities, citric acid can add ’freshness’and flavor to wines

4 - residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides (sodium chloride - g / dm^3): the amount of salt in the wine

6 - free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density (g / dm^3): the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulfates (potassium sulfate g / dm^3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant

11 - alcohol (% by volume) : the percent alcohol content of the wine

12 - quality: ranking from experts on a scale of 0-10, where 0 is worst and 10 is best

Now lets see the descriptive stats of the data set just to get some more information on what we’re dealing with.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##  quality 
##  3:  20  
##  4: 163  
##  5:1457  
##  6:2198  
##  7: 880  
##  8: 175  
##  9:   5

*Things to note:

## [1] 1
## [1] 868
## [1] 18

We can see that there is only 1 sweet white wine, there are 868 wines that have a free SO2 levels greater than 50ppm, and there are 18 wines that have a volatile acidity level greater than 0.7 g/L. Lets save these wines in a table for further analysis at a later point and keep these wines in mind.

I made a data frame above (odd_wines) to keep track of wines that have features that fall outside normal levels which correlate with a detectable change in taste, according to the info provided with the data set. I’m going to make a general analysis first and see what I can find by looking at the data as a whole, and then follow it up by looking at wines with “detectable” features and see if the features that they have correlate with higher or lower quality ratings.

For now, lets take a look at distributions of the data.

Things to note:

Now lets look at the distribution of alcohol content amongst the wines

Things to note:

Residual sugar is monstly under 10.0 g/dm^3 for the majority of the wines in the dataset.

Majority of wines have chloride concentrations between 0.025 and 0.05 g/dm^3.

Sulphates content for the wines tends to be within the range of 0.35 and 0.55 g/dm^3 snd looks like a normal distribution.

Most wines have a total SO2 content between 100 and 200 mg/dm^3(ppm)

Most wines have between 0.25 and 0.5 g/dm^3 of citric acid added to them.

Most wines have a density ranging from 0.990 to 0.9975 g/cm^3

Most wines have between 20 and 50 mg/dm^3 of free sulfur dioxide, which is under the detectable limit in wines by taste.

Most wines have between 0.2 and 0.3 g/dm^3 of volatile acidity, which is well under the detectable by taste limit of 0.7 g/dm^3 (g/L).

Fixed cidity is similar in distribution as volatile acidity, however it is on a different scale where amounts are greater than 1 g/dm^3 in fixed acidity concentration.

Now one last look at another variable, pH:

Things to note:

Univariate Analysis

What is the structure of your dataset?

This data set consists of 4898 observations of wines with 12 features being measured:

1 - fixed acidity (tartaric acid - g / dm^3)

2 - volatile acidity (acetic acid - g / dm^3)

3 - citric acid (g / dm^3)

4 - residual sugar (g / dm^3)

5 - chlorides (sodium chloride - g / dm^3

6 - free sulfur dioxide (mg / dm^3)

7 - total sulfur dioxide (mg / dm^3)

8 - density (g / cm^3)

9 - pH

10 - sulfates (potassium sulfate - g / dm^3)

11 - alcohol (% by volume)

12 - quality (score between 0 and 10, worst to best respectively)

What is/are the main feature(s) of interest in your dataset?

Going forward, I’d like to see what changes if any are correlated with a higher rating from critics. Based on the above analysis, we can see that there are a couple of factors that may be responsible for a different rating. Because rating is based on a subjective scale of taste, you may be able to see what wines ranked better or worse on average based on factors that may give the wines a different taste, such as residual sugar levels, alcohol by volume, free sulfur dioxide content, or volatile acidity levels.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

I think if we also look at some features such as chloride and citric acid content, we can see if taste is affected by adding different levels of chemicals to the wines.

Did you create any new variables from existing variables in the dataset?

I created a few variables to add to the data set, one being total acidity which is simply the sum of volatile and fixed acidity.

The second variable I created was the bound sulfur dioxide variable, which is the difference of the total sulfur dioxide content and the free sulfur dioxide. This variable is a measure of the ionic form of sulfur dioxide (bisulfite) and is typically in solid salt form.

The third variable I made is the sum of the additives in g/dm^3, which consist of citric acid, sulfates, and chlorides. This is a feature that will be of interest in the future since these are described to have an effect on flavor and body of wines and may have a correlation with quality ranking.

Any unusual distributions or any data cleaning done?

There were no unusual distributions in the data observed above, and there were only a few minor changes I had to do to the data to clean it up since this data set was a tidy data set provided by Udacity. I just dropped a column that listed the wines in numerical order, and made the quality variable an ordered factor since it is a metric of rank for the data set.

Bivariate Plots Section

Lets look at some bi-variate plots to see what insights we can gather from the data.

Lets get a quick summary of the data set by displaying a matrix of plots using ggpairs to see the relationships that the features have with one another.

The column labels 1-12 correspond to the following variables:

1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulfates 11 - alcohol 12 - quality

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.35   11.00   12.60 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## wine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

One thing that is interesting about the plot above is that there seems to be a trend where alcohol content is positively correlated with quality, where wines with a higher alcohol content are rated a bit higher on the quality scale. Looking at the summary statistics of the graph, we can see that the median and mean alcohol content increase with quality level from wines that have a quality ranking of 5 and up. Although it isn’t consistent with the lower end of the quality ranking spectrum, there might be some other underlying factors that may counteract the positive effect that alcohol content seems to have on quality.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.587   4.600   6.393  10.700  16.200 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.300   2.500   4.628   7.100  17.550 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.100   4.300   5.671   8.200  14.800 
## -------------------------------------------------------- 
## wine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    2.00    2.20    4.12    4.20   10.60

Based on the above visualization, we can see that there doesn’t seem to be a major difference in residual sugar content across different qualities of wines. One would assume that sweetness would affect the quality ranking of a wine since taste is taken into account when ranking the wines. However, there doesn’t seem to be any pattern here that we can use. We’ll revisit residual sugar later in our multivariate analysis to see if there might be a link to another feature that can be missed here.

Now lets look at free sulfur dioxide

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   13.25   33.50   53.33   47.50  289.00 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   18.00   23.36   30.50  138.50 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   22.00   35.00   36.43   50.00  131.00 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   24.00   34.00   35.65   46.00  112.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   33.00   34.13   41.00  108.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   28.00   35.00   36.72   44.50  105.00 
## -------------------------------------------------------- 
## wine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    24.0    27.0    28.0    33.4    31.0    57.0

Looking at the above plot and summary statistics, there doesn’t seem to be a big difference in free SO2 content between wines of different quality. As we stated above, SO2 above 50 ppm is detectable in wines and could affect the taste. However if we look at the median SO2 content for each quality of wine, we can see that none of them are over that threshold. There are some outliers that are well above 50ppm, however there doesn’t seem to be a significant effect on the overall quality of the wines. Perhaps free SO2 content doesn’t affect taste as much as anticipated based on the information provided with the data set. We’ll explore this as well in the future multivariate analysis portion.

Now lets look at pH and whether the acidity of the wine will affect the quality ranking

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.870   3.035   3.215   3.188   3.325   3.550 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.830   3.070   3.160   3.183   3.280   3.720 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.790   3.080   3.160   3.169   3.240   3.790 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.080   3.180   3.189   3.280   3.810 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.840   3.100   3.200   3.214   3.320   3.820 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.940   3.120   3.230   3.219   3.330   3.590 
## -------------------------------------------------------- 
## wine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   3.280   3.280   3.308   3.370   3.410

Based on the above graph, there does seem to be a very slight difference in pH across quality. Following the summary statistics, we can see that there seems to be a trend where the mean and median pH levels increase across quality ranking. This means that as the wine becomes more neutral and less acidic, quality ranking goes up. One thing to note is that pH is dependent on several factors, which the data set provides such as total acidity (fixed/volatile acidity), citric acid content, alcohol content, and sulfur dioxide content. This will be an interesting feature to look at in our multivariate analysis because of this dependence on a wines composition.

Lets look at the total acidity feature that we created earlier and see how its related to quality

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.415   6.820   7.705   7.933   8.857  12.030 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.450   6.745   7.310   7.511   7.920  10.910 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.690   6.660   7.140   7.236   7.730  10.550 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.110   6.550   7.030   7.098   7.567  14.470 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.370   6.505   6.980   6.997   7.460   9.450 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.125   6.475   7.040   6.935   7.490   8.570 
## -------------------------------------------------------- 
## wine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.960   7.260   7.360   7.718   7.640   9.370

According to the plot, there doesn’t seem to be a major trend in total acidity and quality. There does look like there’s a significantly more variability in the total acidity content of wines with a quality ranking of 3 than with other wines. Regardless, this doesn’t seem to be a very useful metric for now and we’ll have to explore this later.

Now lets look at additives, which is the sum of citric acid, chlorides, and sulfates.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6370  0.7812  0.8235  0.8648  0.9325  1.4440 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3740  0.6670  0.8170  0.8305  0.9835  1.4420 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3440  0.7420  0.8600  0.8714  0.9830  1.6540 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.7540  0.8550  0.8743  0.9630  2.2320 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3870  0.7538  0.8440  0.8669  0.9555  1.4170 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4060  0.7370  0.8160  0.8511  0.9405  1.3910 
## -------------------------------------------------------- 
## wine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7180  0.8710  0.9210  0.8794  0.9420  0.9450

There doesn’t seem to be a trend between the different quality wines based on the total amount of additives to the wine. They all tend to have amounts less than 0.9 g/dm^3 of total additives, and the mean and medians are all somewhat close to one another. Perhaps the sum of the additives isn’t as impactful as we would’ve thought.

One last feature that I wanted to look at was bound sulfur dioxide content and how it differs across quality

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    14.0    82.5   106.0   117.3   152.2   331.0 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    67.5   102.0   101.9   133.8   195.0 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    91.0   114.0   114.5   137.0   293.5 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    15.0    76.0    97.0   101.4   123.0   243.0 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   71.00   86.00   90.99  106.00  199.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   42.00   71.00   84.00   89.45  104.50  159.50 
## -------------------------------------------------------- 
## wine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    61.0    62.0    82.0    82.6    96.0   112.0

There seems to be a slight trend with bound sulfur dioxide and quality, where mean and median bound sulfur dioxide content decreases across quality.

Out of pure interest, lets look at some scatter plots to see if our data set is consistent with real world chemistry concepts. The first features that I’ll be looking at is density and its relationship with alcohol content. Density of a fluid is a function of its mass divided by its volume. The density of water is typically used as a relative standard, where it is 1 g/cm^3. According to Wikipedia, the density of ethanol is measured as 0.7893 g/cm^3, which means that as you add more ethanol to water, its overall density should decrease. This should be reflected in our data since as you increase alcohol by volume content in a wine, its density should decrease from 1 g/cm^3. Lets see if this trend is present in our data:

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

As we can see above, you do see a negative correlation between alcohol content and density of a wine. Again, this is consistent with principles of physical chemistry, where an overall decrease in density should be observed with an increase in alcohol content.

Lets look at one more set of features and see if they’re consistent with real world science. In chemistry, pH is a measure of basicity or acidity of a solution and measured on a scale of 0-14, where 0 is strongly acidic, 7 is neutral(pure water), and 14 is strongly basic. Because of this, it would make sense that an increase in acidic content should cause a decrease in pH of a fluid. In context of our data, an increase in total acidity should be correlated with a decrease in pH of a wine. Lets see if this property of physical chemistry is consistent with our data.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$total.acidity and wine$pH
## t = -33.116, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4503932 -0.4046240
## sample estimates:
##        cor 
## -0.4277827

As we can see above there is a negative trend in the data, where a decrease in total acidity is correlated with an increase in pH. This is consistent with the principles of physical chemistry, although the correlation isn’t as strong as one would be led to believe based on the cor.test() done between the two features. Because pH can be affected by other factors such as alcohol content and sulfur dioxide, it may be the reason why the cor.test is affected. This will be explored later in the multivariate analysis portion.

Lets look at some other scatter plots that focus on the relationships of some features of interest with either density or alcohol content to see if there are any other findings that coincide with science.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$total.acidity and wine$alcohol
## t = -7.9076, df = 4896, p-value = 3.218e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.13986371 -0.08455661
## sample estimates:
##        cor 
## -0.1122971
## 
##  Pearson's product-moment correlation
## 
## data:  wine$total.sulfur.dioxide and wine$alcohol
## t = -35.15, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4709775 -0.4262443
## sample estimates:
##        cor 
## -0.4488921
## 
##  Pearson's product-moment correlation
## 
## data:  wine$residual.sugar and wine$alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312
## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$total.acidity
## t = 19.417, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2411894 0.2932005
## sample estimates:
##       cor 
## 0.2673897
## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$total.sulfur.dioxide
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5094349 0.5497297
## sample estimates:
##       cor 
## 0.5298813
## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

A lot to cover here, but lets start with alcohol content. First off, we can see that there really isn’t much correlation with total acidity content. However, there seems to be a negative correlation between alcohol content and both total sulfur dioxide and residual sugar content. The correlation between alcohol content and total SO2 is -0.449, and the correlation between alcohol and residual sugar is -0.451. Knowing how fermentation works, it makes sense that residual sugar should decrease with an increase in alcohol content, since sugars in wine are converted to alcohol by bacteria during the fermentation process. Therefore the less free sugar in the wine, the more alcohol that should be present in the wine after fermenting it. The relationship between total SO2 and alcohol is a bit more tricky, since the information on their relationship isn’t as readily available. Regardless, it is something to look at in the future when thinking about how a wines alcohol content and its relationship to other features go into the quality ranking of a wine.

Now lets discuss the relationship that density has with the same variables that we discussed above. Looking at the plots, there doesn’t seem to be a huge positive correlation with total acidity. We can see a much stronger relationship when looking at total sulfur dioxide and residual sugar. Because total sulfur dioxide includes free SO2 gas(0.0026g/cm^3) and its salt form bisulfite(1.48g/cm^3), the addition of these components to a wine seem to result in an overall increase in density. This is also true of sugar(1.50 - 1.65 g/cm^3) when assuming that adding a more dense component to wine will result in an increase in overall density.

As we mentioned before, we singled out some outliers of wines and put them in another data frame, odd_wines. Lets look at some of the plots that we explored before when looking at the entirety of the data.

Getting a closer look on some of the more extreme wines of the data set helps to gather a bit more insights into what makes a quality wine. The box plots are very similar to the ones earlier that we based on the data set as a whole. One thing that we can gather is that there is a lot of variance in wines with a quality of 3 in respect to the features that we looked at above. They tend to be more basic, tend to have more free SO2, and have a low residual sugar count. These factors could explain why those wines rank so low on taste, since these should be detectable when drinking the wine. As far as wines on the higher end, there’s not too much to see other than what we saw before where median alcohol content increases with quality. Whats interesting to note is that although these wines have one or more features that are high enough in quantity that they should be detectable and affect taste, they mirror the same overall trend in quality when looking at all the wines.

Bivariate Analysis

Some things to look at moving forward:

Multivariate Plots Section

Now lets do some multivariate analysis to help us see if some of the dead ends we saw before are really dead ends. First off, lets look at alcohols relationship with other features in respect to quality rating.

Now lets look at the ratio of acidity to alcohol by volume and also the ratio of total SO2 to alcohol with respect to pH and see what we find.

One last set of plots, which will look at the relationship between additives and alcohol, colored by quality ranking. The three plots will look at the entirety of the data set, a subset of the worst rated wines (3,4), and a subset of the best rated wines (8,9).

Multivariate Analysis

Things to note:

*As seen before, residual sugar decreases as alcohol increases but stays near 6 g/dm^3 for higher rated wines.

*Looking at the sum of additives in wine doesn’t seem to have a major impact on the overall quality of a wine with respect to alcohol. The additive content is pretty similar across the board for all quality types. For the most part, you can make the assumption that as long as the wine has a total additive concentration under 1 g/dm^3 then it is more likely to rank higher.

Final Plots and Summary

First off, I wanted to show the distribution of quality amongst the dataset. Quality seems to have a normal distribution, wherewe can see that the majority of the wines are between 5 and 6 and that there are no wines that are absolutely terrible (1 or 2) or wines that are a perfect 10. In fact, there are very few wines that have a rank above 8 and below a 5 so one would assume that you have to do something extraordinary to rank on the extreme ends of the ranking spectrum. This led to the question of what features are most strongly associated with quality. Further analysis into the features was done to show the relationship between them and quality.

When exploring the data, I found that alcohol content was the strongest indicator of wine quality, where the median alcohol percentage increased with an increase in wine quality ranking. You can also see the mean alcohol content increase over quality (as denoted by the red x). Although we can see that median alcohol content for wines with a 3 or 4 ranking are similar to that of a wine with a 6 ranking, the general trend is that an increase in ABV is correlated to an increase in quality. This can be attributed to other factors that affect taste,like sugar content, and other additives to preserve flavor or add a certain property to the wine.

Finally, the last series of plots to show the relationship alcohol has with different features that are associated with changes in tase of a wine. We can see above that there are some trends that the extreme ends of the wine quality show and that may be the keys to making a great wine or avoiding the creation of a terrible wine. First off, we can see that residual sugar decreases as alcohol increases However we now see that higher rated wines (8,9) fall within a residual sugar range of about 4 - 7 g/dm^3 and low rated wines (3,4) tend to be below 5 g/dm^3 and have an abv less than 11%.

We can also see a relationship with free sulfur dioxide, where higher rated wines are closer to 50 ppm. As previously discussed, when free SO2 crosses that threshold of 50 ppm, it leaves a detectable quality that can affect the taste of the wine. Although it serves a purpose in wine production, too much or too little seems to correspond with a decrease in quality.

Volatile acidity seems to impact wines with lower alcohol contents, since wines with lower alcohol contents and average volatile acidity levels rank low on the quality scale. If you look down the alcohol scale, you’ll see that volatile acidity levels don’t really change too much and it doesn’t seem to impact the rating.

Additives, which include citric acid, chlorides, and sulfates, don’t appear to have a big impact on wine quality rating. These factors are related with different tastes since they can affect the chemical properties of the wine such as pH or density, so it’s interesting that an aggregate metric of these factors doesn’t really show a trend. When looking at them individually, there wasn’t muchof a trend there either so perhaps this is simply just a dead end. One thing to note is how it appears that higher rated wines have a total additive concentration ranging between 0.8 and 0.9 g/dm^3, so perhaps that’s something to be taken into account during wine production.

Reflection

There are a couple of things to reflection after this lengthy analysis. According to our findings, there aren’t a whole lot of qualities that affect how a wine is perceived to taste on their own. Rather, its a combination of factors that when added in the right amounts, can lead to a higher quality ranking in terms of taste. I found that alcohol content of a wine, in combination with concentration ranges of certain additives and preservatives are more likely to be indicators of whether or not a wine will taste great.

Despite the fact that there do seem to be trends that bad and good wines follow, they also don’t necessarily fit a linear regression model since their features don’t really have much of a linear relationship with one another. Instead, using a modeling technique such as random forest classification or K-nearest neighbors would be a better choice for further studies and for creating a model to predict where a wine will fall in the ranking.

Although we explored a lot of the data, there is still so much more that you can look at. However, it may be more useful to have different features of a wine to analyze than ones such as density, where you can look at other variables such as dates of production, location of production, information on the critics, and perhaps including more data from wines that have poor quality and high quality rankings.

Ultimately, trying to be able to predict peoples perception of wine is tricky. As we saw before, there is no one clear defining variable that makes or breaks a wine. This can be attributed to the fact that its difficult to quantify something so subjective as taste,where one persons taste can differ so much from another persons. This is a good start to try to see if you can predict what makes a wine good in terms of taste, but I believe that you’ll always have a problem trying to make assumptions from such a subjective variable.